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Abstract 

Complex networks have recently become the focus of research in many fields. Their structure reveals crucial information for 
the nodes, how they connect and share information. In our work we analyze protein interaction networks as complex 
networks for their functional modular structure and later use that information in the functional annotation of proteins 
within the network. We propose several graph representations for the protein interaction network, each having different 
level of complexity and inclusion of the annotation information within the graph. We aim to explore what the benefits and 
the drawbacks of these proposed graphs are, when they are used in the function prediction process via clustering methods. 
For making this cluster based prediction, we adopt well established approaches for cluster detection in complex networks 
using most recent representative algorithms that have been proven as efficient in the task at hand. The experiments are 
performed using a purified and reliable Saccharomyces cerevisiae protein interaction network, which is then used to 
generate the different graph representations. Each of the graph representations is later analysed in combination with each 
of the clustering algorithms, which have been possibly modified and implemented to fit the specific graph. We evaluate 
results in regards of biological validity and function prediction performance. Our results indicate that the novel ways of 
presenting the complex graph improve the prediction process, although the computational complexity should be taken 
into account when deciding on a particular approach. 
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Introduction 

A protein within a cell is rarely a single constituent of the 
mechanism that performs its function. It has been observed that 
proteins involved in the same cellular processes often interact with 
each other [1] making the protein-protein interactions (PPI) 
fundamental to almost all biological processes [2]. Significant 
amount of data is produced with the advancement of high- 
throughput technologies. Yeast-two-hybrid, mass spectrometry, 
and protein chip technologies have allowed the construction of 
large interaction networks [3], and are now scaled up to produce 
extensive genome-wide data sets that are providing us with a first 
glimpse of global interaction networks. However, these rapid 
improvements come at the price of a vast majority of known 
proteins not being experimentally characterized, and their 
function is yet unknown [4]. As has been commonly realized, 
the acquisition of data is but a preliminary step, and a true 
challenge lies in developing effective means to analyze such data 
and endow them with physical and/or functional meaning [5], 
This has prompted the computational function prediction as one 
of the most challenging problems of the postgenomic era. 

PPI data has the nature of networks. This provides a global view 
of the context of each protein. There is more information in a 
protein interaction network (PIN) compared to sequence or 
structure alone. A protein in a PIN is annotated with one or 
more functional terms. Multiple and sometimes unrelated 
annotations can occur due to multiple active binding sites or 



possibly multiple stable tertiary conformations of a protein. The 
annotation terms are commonly based on an ontology. A major 
effort in this direction is the Gene Ontology (GO) project [6] . GO 
characterizes proteins in three major aspects: molecular function, 
biological process and cellular localization. 

We can now characterize the computational function prediction 
as the process of understanding the relationship between the 
protein's interaction context and its functions. Grouping proteins 
of the PIN into sets (clusters) which show greater similarity among 
proteins in the same cluster than in different clusters has been 
shown as an effective approach to accomplish this goal [7] . Since 
biological functions can be carried out by particular groups of 
proteins, dividing networks into naturally grouped parts (clusters) is 
an essential way to investigate some relationships between the 
function and topology of networks or to reveal hidden knowledge 
behind them. Typical graph clustering methods often result in a 
poor clustering arrangement [8] so PINs have been weighted 
based on topological properties such as shortest path length [9,10] 
and clustering coefficients [1 1] in order to achieve an improve- 
ment in the clustering results. In [12-15] the edge-betweenness 
and its modified version, using weights generated from micro array 
expression profiles, have been used as a method to find functional 
modules in the PIN. A method that combines the results of 
multiple, independent clustering arrangements into a single 
consensus cluster structure is presented in [16]. 

PINs have also been analyzed by extracting protein complexes, 
i.e. finding densely connected subgraphs within the network. To 
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infer such complexes many methods have been proposed. The 
Markov Cluster algorithm (MCL) [17] simulates a flow on the 
graph by calculating successive powers of the associated adjacency 
matrix. Restricted Neighborhood Search Clustering (RNSC) [18]), 
is a cost-based local search algorithm that explores the solution 
space to minimize a cost function, calculated according to the 
numbers of intra-cluster and inter-cluster edges. Super Paramag- 
netic Clustering (SPC) [19] is a hierarchical clustering algorithm 
inspired from an analogy with the physical properties of a 
ferromagnetic model subject to fluctuation at nonzero tempera- 
ture. Molecular Complex Detection (MCODE) [20] is based on 
node weighting by local neighborhood density and outward 
traversal from a locally dense seed protein to isolate densely 
connected regions. Detection of highly connected subgraphs 
(cliques) combined with Monte Carlo optimization is considered 
in [21]. The authors distinguish two types of clusters: protein 
complexes and dynamic functional modules. Highly connected 
subgraphs algorithm is used in [22] for discovery of protein 
complexes, while the authors of [23] use spectral clustering for 
generating modules, and possible functional relationships among 
the members of the cluster for predicting new protein-protein 
connections. More recent approaches exploit semantic similarity 
measures based on GO between pairs of proteins within the PIN. 
PROCOMOSS [24] uses a multi-objective evolutionary approach 
in which graphical properties as well as biological properties based 
on GO semantic similarity measure are considered as objective 
functions for detecting protein complexes in a PIN. CSO [25] 
performs clustering based on network structure and ontology 
attribute similarity on GO attributed PINs. Both of these 
algorithms achieve state-of-the-art performance. These results 
are another proof that topological features of the PIN alone are 
insufficient for proper partitioning of the PIN and the network 
needs to be augmented. 

In this paper we address the problem of function prediction in 
twofold manner. First, we propose novel graph representations of 
the PIN each having different level of complexity and different 
inclusion of the annotation information within the graph. Second, 
we select state-of-the-art algorithms for cluster detection that have 
not yet been used on PINs and we examine their efficiency in 
detecting clusters within the different graph representations of the 
PIN as previously defined. Since we are interested in function 
prediction the exploration of these methods goes one step further 
in establishing efficient clustering in terms of accurate cluster based 
function prediction and establishing the benefits and the draw- 
backs of combining the methods with the different graph 
representations of the PIN in the functional annotation process. 
We conclude the paper with a discussion of what would be the 
recommended approach of predicting a function in the PIN 
depending on the priorities of the outcome i.e. what is the best 
experimental setup if the prediction is done network wide versus a 
prediction for a single (or a small group of) protein(s), and if the 
prediction accuracy is of higher importance than its coverage, and 
vice versa. 

Materials and Methods 

Protein-Protein Interaction Data 

High-throughput techniques are prone to detecting many false 
positive interactions, leading to a lot of noise and non-existing 
interactions in the databases. Furthermore, some of the databases 
are supplemented with interactions computationally derived with a 
method for protein interaction prediction, adding additional noise 
to the databases. Therefore, none of the available databases are 



perfectly reliable and the choice of a suitable database should be 
made very carefully. 

We conduct our experiments on Saccharomyces cerevisiae PPI 
data which are compiled from a number of established datasets 
used in previous research on PPI. Namely, we first merge the PPI 
datasets of Uetz [26], Ito [27], Ho [28], Krogan [29], and Gavin 
[30]. We then filter out interaction from the merged dataset based 
on the number of supporting evidence found in DIP [31], MIPS 
[32], MINT [33], BIND [34] and BioGRID [35]. The resulting 
dataset contains only protein-protein interactions which have 
more than one experimental evidence. The functional terms for 
each protein are taken from the SGD database [36], and are 
unified with the GO terminology. This data is further purified as 
proposed in [37]. First, the trivial functional terms, like 'unknown 
molecular function' are erased. Then, additional terms are 
calculated for each protein by the policy of transitive closure 
derived from the GO. The extremely frequent terms (appearing as 
annotations to more than 300 proteins) are also excluded, because 
they are very general and do not carry significant information. 
The final dataset is highly reliable and consists of 2502 proteins 
with 6354 interactions between them and has a total of 888 
functional terms and 31515 protein-term pairs. The average node 
degree of the resulting protein interaction network is 5.08 and the 
clustering coefficient is 0. 18. Figure 1 shows the degree distribution 
of the network on log-log scale. 

Protein Interaction Network Representation 

As previously stated, PPI data has the properties of a network 
and therefore can be represented as a graph. We introduce several 
different graph representations of the PIN, each of which 
represents the information within the data at a different level. 
Our first goal is to explore the level of detail that is sufficient for 
effective clustering of the PIN and function prediction, and to 
show that the novel augmented representations significantly 
improve performance. We point out here that all graphs resulting 
from a PIN are undirected since an interaction itself is undirected. 
The different representations with ascending level of complexity 
are defined as follows. 

Simple Graphs. The most basic definition of a PIN graph 
representation is through simple graph with G\ = ( V, E) where 
nodes i,jeV correspond to proteins, and edges (i,j)eE corre- 
spond to interaction between "proteins" i and j. The simple graph 
is unweighted. With this graph we use only the topology of the PIN 
to determine clusters. For our data we have | V\ = 2502 and 
|£| = 6354. 

Weighted Graphs. The simplest way to enrich the previous 
representation is to add weights to edges from E and thus define a 
weighted graph G2 = ( V, E, W) for the PIN, where W is a matrix 
whose elements Wy are the weights of the edges (i, j)eE. Weights 
can be calculated in three different ways [38]. 

a) Content-based weights: a content-based weight calculation is 
one that assigns weight Wy to the edge (1, j) by looking at the 
terms ("content") associated with nodes i and j, not taking 
their environment (the graph structure) into account. If is 
the set of terms associated with node i and tj is the set of terms 
associated with j, Wy can be computed using the normalized 
Jaccard Index as follows: 

" 2 V |* ( | + \ tj \ J 1 > 
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Figure 1. Degree distribution of the primary PIN on log-log scale. 
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b) Structure-based weights: a structure-based weight calcula- 
tion is one that takes the context of the nodes i and j into 
account, but not the content of the nodes themselves, when 
calculating weight Wy for the edge In order to calculate 
Wy we need to derive a way to map the context of i and j so 
that the result contains all the structural information about 
these nodes. The structural information of the graph G2 is 
naturally encoded in its adjacency matrix A = [ay] so we can 
define the weight matrix W 2 = [w 2 -] as follows: 

W 2 = (W l A + AW x ) (2) 

where W x = [wjj] is the content-based weight matrix. Since 
ay = 0, V(7, j)£E, the first part of Eq. 2 gives the sum of all 
content-based weights of edges between node i and all 
neighbours of j, while the second part is the sum of all 
content-based weights between node j and all neighbours of (. 
PINs are known to have proteins that interact with many 
other, which gives rise to hubs within the graph representing 
the PIN. Eq. 2 will give high scores to nodes with high degree 
and vice versa, i.e. low scores to nodes with low degree, so we 
average the values to overcome this unwanted effect and get 
Eq. 3. Additionally iv? are normalized to be in the same 
range as Wy. 

W 2 = ^(W i A i +A 2 W r ) (3) 

where A 1 = [ay/ £f =1 %], A 2 = [ay/ £li at,], and N=\ V\. 

c) Hybrid weights: it combines both content-based and 
structure-based weights; a natural way of combining them 



is taking the average of the two: 

W=^(W 1 + W 2 ) (4) 

We note that many other ways of defining W l and W 2 are 
possible. We are pointing out that multiple definitions of 
weighting may make sense, and that, depending on the task, 
one may be more suitable than the other. We will show how 
the different weighting schemes influence the result of 
clustering and function prediction. 

Protein-Term Graphs. We define G 3 = (V\J T, E{JE t ) as a 
protein-term graph in which the terms associated to proteins in 
the PIN become part of its representation. More specifically 7 is 
the set of all terms present within the PIN and each term t f is 
represented as a node in the graph. E, is the set of edges (i,tj) 
where ieV, tjeT and term tj is associated with protein i in the PIN. 
This definition of the representation and the set of additional edges 
E, takes into account additional edges only between protein nodes 
(V) and new term nodes (7), and no edges exist between two term 
nodes, as shown on Figure 2. V and E have the same definition as 
in the previous representations. The graph is unweighted. 

In this way functional relationships between the proteins in the 
PIN are directly included in the graph representation and 
therefore in the process of clustering and function prediction. 
When we create the protein-term graph for our data we have a 
total of 3390 nodes (|K| = 2502, ] 71 = 888) and 37869 edges 
PI = 6354, \E,\ = 31515). 

Full Functional Connected Graphs. The full functional 
connected (FFC) graphs are defined as G 4 = ( V,E[JE f , W f ). Let t, 
and tj be the sets of terms associated with nodes i and j, 
respectively, then for edge (i, j) we have (/, j)eEf if and only if 
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Figure 2. Protein-term graph. Terms associated to proteins and their connections are added to graph. Gray nodes 1, 2, 3 and 4 are proteins and 
red nodes fl, f2 and f3 are terms. If one node is annotated with one or more terms links to these nodes are added (red links). 
doi:1 0.1 371 /journal.pone.0099755.g002 



(i, j)£E and ti f\tj^0. W 1 = [Wy] is the weighted matrix. In other 
words if two proteins in the PIN share a term, an edge is added in 
the graph between them even if they don't interact together, thus 
creating "false" interactions. However the information for the 
"true" interactions is preserved through the weight matrix. 
Namely, each edge is assigned a content-based weight, with an 
additional constant being added to edges representing real 
interactions. Formally we have: 



where 



\ti\ 



\hC\tj\ 



(5) 



used in a PIN, and even further explore how the combination with 
the different PIN representations affect the function prediction 
performance. 

Modularity Function Algorithms. One of the biggest 
breakthroughs in cluster detection was the Girvan and Newman 
modularity function [41]. They propose an equation that 
calculates the quality of a given clustering compared to a 
corresponding random graph. The randomization of the edges is 
done with preserving each node degree. The modularity function 
is defined as: 



2m 



E 



kjkj 
2m 



&{Ci,Cj) 



(7) 



1, if(/j)£E 
0, otherwise 



(6) 



for every (i, j)eE{JEf. We take the constant to be 1 since that is 
the maximum value of the content-based weight in the case of 
identical terms in the two connected nodes. This way we ensure 
that each true interaction weight is larger (or equal in the worst 
case) than any false interaction weight, but in the same time 
allowing the content similarity to have at most the same effect as a 
true interaction. The FFC graph for our PIN has a total of 
1086948 edges (|£| = 6354, \E f \ = 1080594). 

Clustering Algorithms 

The modern science of networks has brought significant 
advances to our understanding of complex systems, with the 
organization of the vertices in clusters (also referred to as 
communities) being one of the most relevant features of the 
graphs representing such systems. The problem of detecting 
clusters is very hard and not yet satisfactory solved, and is in the 
focus of a large interdisciplinary scientific community [39] . PINs 
are complex networks, and as such communities (corresponding to 
functional modules and complexes) emerge in their graph 
representations [10]. In our work we focus on most recently 
developed methods for cluster detection in graphs which have 
been classified as most efficient [40] . These algorithms are initially 
employed in detecting community structure in different real-life 
networks and to our knowledge have not yet been used in 
clustering PINs. Taking this into account our motivation and goal 
is to explore how these state-of-the-art algorithms perform when 



The term Ay has different meaning for different graph represen- 
tations. When we work with unweighted graphs (G\,Gi) the term 
is the corresponding member of the adjacency matrix (Ag = ay), 
while in weighted graphs ((72,(74) the term is the corresponding 
member of the weight matrix (Ay = Wy) since these graphs are a 
simple generalization [42]. Terms fe,- and m are defined with 
k,= YljAij and n = (l/2)Y^,yAy > and in the case of unweighted 
graphs correspond to node degree and total number of nodes, 
respectively. The probability of an edge existing between nodes i 
and j if connections are made at random but respecting node 
degrees is kjkj /2m, c,- defines the cluster to which node i is 
attached and 5(c,-,C/) is the Kronecker delta symbol where 
<5(c,-,c ; ) = 1 if c, = Cj and 0 otherwise. This function gives the 
difference of the fraction of edges that fall into the cluster and the 
expected number of edges distributed at random. A value less than 
1 means that the number of edges in the group is greater than the 
number at random i.e. the cluster is well defined, and otherwise, 
values between zero and — 1 mean that the analysed edges do not 
form good cluster. 

The "Fast Community" (FC) [43] community structure 
inference algorithm is based on a greedy technique that maximizes 
the Girvan and Newman modularity function. The algorithm uses 
hierarchical agglomerative method where at the beginning each 
node represents one cluster. Nodes and later clusters are merged 
trying to maximize the modularity exploring the full topology of 
the graph. The novelty of this algorithm is the usage of data 
structures for sparse matrices, max-heaps, that make this algorithm 
much faster and suitable for analysis of large graphs. 



PLOS ONE I www.plosone.org 



4 



June 2014 | Volume 9 | Issue 6 | e99755 



Function Prediction in PINs via Clustering 



The proposed algorithm from Blondel et al. (BGLL) [44] uses a 
different greedy technique using supervertices for representation of 
the communities and calculating the modularity. At start all nodes 
are in different clusters but as each node chooses a new cluster the 
clusters are replaced with supervertices. Two supervertices are 
connected if there exists an edge between any two nodes from the 
two supervertices. Again at each step the modularity is calculated 
from the initial topology. This algorithm finds maximum 
modularity better than the algorithm used by Clauset et al. [43] 
but its limitation is in the storage demands. 

Multi-Resolution Algorithms. Recently it has been shown 
that modularity optimization may fail to identify clusters smaller 
than a scale which depends on the total number N of links of the 
network and on the degree of interconnectedness of the clusters, 
even in cases where clusters are unambiguously defined, charac- 
terizing these methods with a so called resolution limit [45] . A new 
class of methods that deals with this problem is based on multi- 
scale quality functions. These quality functions incorporate a 
resolution parameter allowing to tune the characteristic size of the 
clusters in the optimal partition and aim at uncovering modules at 
the true scale of organization of a network, i.e., not at a scale 
imposed by modularity optimization. The publication of Lam- 
biotte [46] gives good overview of the existing multi-resolution 
quality functions also presenting a new method that tries to unify 
them by looking into the dynamics of the partitioning problem. 
The key idea is to measure the quality in terms of stability of 
module associated to a stationary Markov process modeled as a 
random walk process. The resulting quality function for detecting 
modules on multiple-scales is defined as follows: 



e=(i-o- 



i 

2m 



E 



tA i: 



kikj 
2m 



5{c h Cj) 



(8) 



where t represents the time parameter of the random walk, 
equivalent to the Hamiltonian introduced by Reichard and 
Bornhodt [47]. This equation is the same as the modularity 
function (7) when the time parameter t is equal to 1. The 
algorithm implementation suggested in [46] and [48] uses the 
same greedy technique for modularity maximization as in [44]. 
We performed experiments for the time parameter ranging from 1 
to 10 (as suggested in [48]) and we obtained the best results when 
the parameter equals 5. We'll refer to this algorithm with time 
parameter set to 5 as TimeBGLL. 

Edge Clustering Algorithms. Partitioning of nodes in a 
graph has the disadvantage of being incompatible with the 
existence of overlapping clusters, i.e. situations where nodes belong 
to several clusters. This overlap is known to be present at the 
interface between clusters, but can also be pervasive in the whole 
graph [49]. In these situations a partition of the nodes is 
questionable as it imposes undesired constraints on the cluster 
detection problem. Since edges in the graphs representing the 
PINs often correspond to one particular type of interaction in the 
PIN, they typically belong to one single cluster. Therefore we 
define clusters as partitions of edges rather than of nodes. The 
edges incident at a single node may belong to several partitions 
and in this sense, nodes can be members of several clusters. 

We adopt the method proposed in [50] since it naturally fits the 
problem at hand, and also can deal with weighted graphs as 
described in [51]. Without losing generality we can assume the 
definition G\{V,E) for an unweighted node graph. The method 
first transforms G\ in an unweighted line graph L\(G\) and then 
uses random walk dynamics to measure the quality function. In 
principle, any node clustering algorithm can be used. However 



since optimisation of modularity is related to the behaviour of 
random walkers on a graph and the construction of L\(G\) 
preserves the dynamics of random walkers, it makes sense to apply 
the modularity optimisation approach to find the partitions of the 
line graph L\(G\). We use the modularity maximization algorithm 
proposed in [44]. 

The conversion of the graph from node to line is done as 
follows: first the node graph is represented using the incidence 
matrix B\ y\ x \e\ , where B !a is equal to 1 if edge a is related to node i 
and 0 otherwise. The matrix B can be seen as an adjacency matrix 
of a bipartite network. The line graph is constructed with 
projection of the bipartite graph by taking all nodes of one type 
for the nodes of the projected graph. A link is added between two 
nodes in the projected graph if two nodes have at least one node of 
the other type in common in the original bipartite graph, resulting 
in the adjacency matrix C\e\x\e\ of the line graph L\{G\), with 
elements defined by: 



(9) 



where 5 a p is the Kronecker delta symbol. 

By calculating the adjacency matrix as in Eq. 9 nodes with high 
degree, hubs, are given too much prominence in the line graph, so 
normalization is used to avoid this effect and C a fi is calculated 
with: 



i.k;>l K ' 1 



(10) 



where k t is the degree of node i. 

When we work with weighted node graphs, G2(V ,E,W), a 
second weighted incidence matrix B is introduced, where B^j = w a 
if edge a is incident on vertex j and has weight vv a . Each node i has 
strength st, defined as the sum of all weights of its incident edges. 
As in the unweighted case the normalized adjacency matrix is 
computed for the weighted line graph Li_(Gi) given with: 



(11) 



The visual representation of the node to line graph transformation 
is shown on Figure 3. 

Random Walks and Maps Algorithms. The ability of 
random walks to generate dynamics and represent information 
flow in the network makes them suitable for usage in the clustering 
problem. Probability flow of random walks on graph are used for 
creation of efficient and accurate clustering method by Rosvall and 
Bergstrom (Infomap) [52]. This algorithm additionally uses 
Huffman coding to describe the path on the network that also 
allows compression of the maps and speeding up the module 
detection. Using this coding retention of the unique names of the 
important structures formed during the random walks is provided. 
The random walk equation used for undirected graphs is as 
follows: 



X(t+l) = (l-r)AX(t) + rS 



(12) 



where in the case of unweighted graphs (G\,Gi), A is the 
normalized adjacency matrix, while in the case of weighted graphs 
{G2,Gn), A is the weight matrix W, r is teleportation or restart 
probability, X(t) is the probability vector for the random walker 
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Figure 3. The figure shows the transformation from node graph (on the left) to corresponding line graph (on the right). Edges a, b, c, 
d, and e from the node graph are mapped to nodes a, b, c, d, and e on the line graph, respectively. 
doi:1 0.1 371 /journal.pone.0099755.g003 



visiting a node at time and S is the starting probability vector 
(usually S is all zeros except start node value equal 1). At beginning 

X(0) = s. 

Functional Annotation 

There are few different methods in the literature for assigning 
terms to a query protein after clusters are determined. Each of the 
methods is based on calculating a score for each term associated 
with a node that belongs to the same cluster as the query node, 
and assigning to the query protein those terms that have a score 
greater or lower than a predefined threshold depending on the 
score type being used. In our work we tested hypergeometric 
enrichment P-value, chi-square statistics and terms frequency 
within the cluster as scores for predicting terms. 

The hypergeometric enrichment P — value for term t is 
calculated with: 



N- 
N- 



(13) 



where N is the number of nodes in the graph representing the 
PIN, T is the number of nodes in the graph that have term / 
assigned to them, C is the cluster size and n t is the number of 
nodes in the cluster that have term / assigned to them. The terms 
enriched within the cluster (i.e. obtaining P — value below some 
threshold) are then predicted for the query node. 

The chi-square statistics score for term t is defined with: 



(n.-e,) 2 



(14) 



where n, has the same meaning as in the previous score and e t is 
the expected number of nodes in the cluster that have term t 
assigned to them. The expected number is calculated using simple 



proportion e t = (T/N)C, with T, N, and C having the same 
meaning as in the previous score. 

The simplest and most intuitive score calculation approach 
would be that each term is ranked by its frequency of appearance 
as a term assigned to nodes within the cluster. This approach is 
derived from the well known Majority Algorithm used in [53], 
where a node is assigned with the most frequent terms occurring in 
its neighbours. Our definition expands the node neighbourhood 
not only to the direct neighbours but to all nodes that are in the 
cluster it belongs to, K: 



(15) 



where T K is the set of terms present in the cluster K, and 



fl, if 
Z Hu. o, 



- th node from K is assigned with the j — th term from T K 



otherwise 



(16) 



We need to note here that when we work with graph 
representation G3, i.e. the protein-term graph, the definition of 
some quantities used in the score calculations need to be altered. 
Namely, we say that a term / is present in a cluster if the 
corresponding term node t belongs to the cluster. The total 
number of nodes in the graph corresponds to the total number of 
protein nodes, the size of the cluster corresponds to the number of 
protein nodes in the cluster, the number of nodes in the graph with 
term t assigned to them corresponds with the degree of term node 
t, and the number of nodes in a cluster with term t assigned to 
them corresponds to the number of edges between term node t 
and protein nodes belonging to the cluster. For the frequency score 
T K is now a set of term nodes and Zg is defined with: 



1 , if / — th protein node from K has an edge to the j — th term from T K 



otherwise O^) 

Our experiments showed that the frequency based score for 
function prediction outperforms the other two scores for any 



Table 1. Summary table for the size of the different proposed graph representations of our PIN. 





Graph representation 


Number of nodes 


Number of edges 


Simple 


2502 


6354 


Weighted 


2502 


6354 


Protein-Term 


3390 


37869 


Full Functional Connected 


2502 


1086948 



doi:1 0.1 371 /journal.pone.0099755.t001 
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Table 2. Summary for the different clustering algorithms used in this paper showing their computational approach and 
complexity, where v is the number of nodes in the graph being clustered, and e is the corresponding number of edges. 





Clustering Algorithm 


Computational Approach 


Complexity 


FC 


modularity maximization using max heaps 


O(v\og 2 v) 


BGLL 


modularity maximization using multi-passes and supervertices 


0(e) 


TimeBGLL 


modularity maximization with resolution parameter corresponding to the time parameter 
of a random walk on the graph 


0(e) 


EdgeCluster 


modularity maximization on the line graph 


0(e) 


Infomap 


Minimal description length of a random walker using Huffman coding for each node 


0(e) 



doi:1 0.1 371 /journal.pone.0099755.t002 



combination of graph representation and clustering algorithm so 
for simplicity all the results presented are based on this approach. 

Results and Discussion 

We tested representative algorithms of the previously described 
clustering algorithms classes, including FC [43], BGLL [44], 
TimeBGLL [48], EdgeCluster [50,51], and Infomap [52]. We 
performed evaluation of the clustering validity of the different 
algorithms used. Each of these algorithms was used to determine 
clusters in each of the different graph representations of our 
Saccharomyces cerevisiae PIN. We evaluated the clustering results 
in terms of functional validity and also in terms of accuracy when 
used in function prediction. 

Before we proceed to the results and the discussions for the main 
focus of this paper, i.e. the function prediction via clustering 
methods, we give a summary of the computational complexity of 
our experiments. Although resources are vast nowadays, com- 
plexity should not be ignored when deciding upon an experimen- 
tal setup. Table 1 gives a summary of the sizes of the proposed 
graph representations of our PIN which is crucial for the expected 
runtime i.e. computational complexity of the clustering algorithms 
which is given in Table 2. As can be seen BGLL, TimeBGLL, 
EdgeCluster and Infomap have essentially linear runtime propor- 
tional to the number of edges within the graph, while FC runs in 
quasilinear time proportional to the number of nodes within the 



Table 3. NMI values expressing the quality of clustering of a 
synthetic graph modeled with the parameters of our PIN 
achieved by employing the clustering algorithms used in this 
paper (Infomap, TimeBGLL, EdgeCluster, BGLL, FC) and the 
algorithms previously used in clustering of PINs (MCL, RNSC, 
MCODE, SPC), as cited in the introduction. 



Clustering Algorithm 


NMI 


Infomap 


0.9916 


TimeBGLL 


0.9062 


EdgeCluster 


0.8732 


BGLL 


0.8514 


FC 


0.8230 


MCL 


0.4979 


RNSC 


0.4562 


MCODE 


0.2360 


SPC 


0.2147 


doi:1 0.1 371 /journal.pone.0099755.t003 



graph, but nevertheless runs faster than any polynomial with 
exponent strictly greater than 1. 

Clustering Validation 

Clustering validation was performed using a synthetic bench- 
mark graph as given in [54] in order to compare the different 
clustering methods used in our work. The synthetic graph was 
modeled with the parameters of the simple graph representation of 
our PIN. Since the aim of this experiment is to determine the 
clustering power of our chosen algorithms and compare them 
among themselves and with other algorithms used in previous 
research the graph representation is of no significance and any one 
can be used. The resulting clusters were compared with the a 
priori known clusters using the Normalized Mutual Information 
(NMI) method proposed in [55]. It is based on defining a 
confusion matrix M, where the rows correspond to the "real" 
clusters, and the columns correspond to the "found" clusters. The 
element of M, My is the number of nodes in the real cluster i that 
appear in the found cluster j. A measure of similarity between the 
clusters, based on information theory, is then: 



NMI(A,B)-- 



v-Qj , (M tj M 



E.ii^log 



M 



K M t Mj 



M 



(18) 



where the number of real clusters is denoted Ca and the number 
of found clusters is denoted Cb, the sum over row ( of matrix M is 
denoted M,-, the sum over column j is denoted Mj and the total 
number of nodes is M. The normalized mutual information equals 
1 if the clusters are identical and 0 if they are totally independent. 
The definition of the measure when the clusters are overlapping 
(EdgeCluster) is given in details in the appendix of [56]. 

Table 3 shows the resulting values for the NMI score calculated 
as previously explained. These results justify the selected 
representative clustering algorithms in this paper as they 
outperform the algorithms, as cited in the introduction, previously 
used in clustering of PINs based on the topological features of the 
network, i.e. MCL, RNSC, SPC, and MCODE. Later experi- 
ments show that the performance "ranking" on function 
prediction more or less follows the one given in Table 3. 

Biological Validity of the Clusters 

We use many different clustering algorithms that produce 
different clusters by size and structure for which we evaluate 
biological relevancy, in other words we test to confirm that the 
cluster structure has not arisen by chance. If a cluster is 
biologically relevant, the genes belonging to the same cluster 
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Table 4. The entropy values as defined with Eq. 20 for each combination of the PIN graph representation and a clustering 
algorithm. 



Representation Clustering 
Alg. 


Simple 


Weighted Content 


Weighted Structure 


Weighted Hybrid 


Protein-Term 


FFC 


Infomap 


0.2528 


0.3034 


0.3018 


0.3002 


0.3156 


0.5361 


TimeBGLL 


0.3064 


0.3381 


0.3271 


0.3213 


0.5832 


0.5783 


EdgeCluster 


0.2953 


0.3294 


0.3216 


0.3172 


0.5716 


0.6713 


BGLL 


0.2707 


0.3113 


0.3027 


0.2993 


0.5613 


0.6472 


FC 


0.2807 


0.3121 


0.3042 


0.3001 


0.5589 


0.6452 



Lower values yield smaller coverage of the average cluster, i.e. fewer mistakes during the term assignment process, but on the downside the necessary terms for 
complete annotation of a query protein may be lacking. In terms of the definitions used for the annotation validation this would mean that lower entropy values yield 
lower False Positives (FPs), but higher False Negatives (FNs). The inverse holds for higher entropy values. 
doi:1 0.1 371 /joumal.pone.0099755.t004 



should have similar biological functions [8]. Therefore the 
functional homogeneity of a cluster is an indicator for its biological 
validity. Most of the methods for calculating a clusters functional 
homogeneity include some form of the P — value measure. In [21] 
a modified P — value, which combines computationally derived 
clusters with "real" complexes derived from the protein databases, 
is used: 



P(overlap) = 



n 2 \ ( N — «2 
k ) \ n\ — k 
N 

n\ 



(19) 



where N is the total number of nodes in the network, n i and n 2 are 
the sizes of the two complexes (the derived and the real one), and k 
is the number of nodes they have in common. This measure is 
effective and good when evaluating a single clustering algorithm 
but for two or more algorithms the evaluation is time consuming 
as it requires extraction of the corresponding real complexes for 
each computed cluster. 

A more efficient way of testing functional homogeneity is 
through functional entropy. The entropy is calculated as the sum 
of the appearance frequencies of all function terms in the cluster, 
and multiplies the logarithm of those frequencies [57]: 



T, 



(20) 



where Fj is the appearance frequency of the term i, given with the 
equation above, Tj is the number of times that term appears in the 
clusters and n is the number of distinct terms present in the cluster. 
If the nodes in the same cluster have consistent terms, the value of 
the functional entropy will be low, being zero when nodes have 
only one term. We performed the biological validation of our 
clustering algorithms using entropy. We retained only clusters with 
more than 2 nodes, and for each combination of graph 
representation and clustering algorithm we calculated the average 
entropy over all clusters. 

The calculated entropy values are shown in Table 4. Taking 
into account the definition of the entropy measure lower values 
would yield an algorithm which is more stringent at identifying 
functionally coherent clusters. A second and more interesting 
aspect of the entropy in relation to our research is the correlation 
of the entropy values and the results of the functional annotation of 
proteins using the clustering algorithms. Namely, the lower the 
entropy of an algorithm, the coverage of the average cluster is 
smaller. The coverage of a cluster here is defined as the ratio 
between the number of terms present in the cluster and the 
number of terms present in the whole network. The lower 
coverage clusters lead to fewer mistakes being made during the 




simple H=0.2528 
FFC H=0.53613 
weighted content H=0. 30342 
weighted structure H=0.3018 
weighted hybrid H=0.3002 
ProteinTerm H=0.3156 



04 0.5 0.6 
false positive rate 




simple 



content structure hybrid ProteinTerm 



Figure 4. Results for the functional annotation for each graph representation using Infomap, showing the ROC curve (A) with H 
indicating the corresponding entropy value and the corresponding AUC values (B). 

doi:10.1371/journal.pone.0099755.g004 
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0.4 0.5 0. 
false positive rate 



simple 



hybrid ProteinTerm 



Figure 5. Results for the functional annotation for each graph representation using timeBGLL, showing the ROC curve (A) with H 
indicating the corresponding entropy value and the corresponding AUC values (B). 

doi:1 0.1 371 /journal.pone.0099755.g005 



term assignment process, but on the downside these clusters may 
lack the necessary terms needed for correct and complete 
annotation of a query protein. In terms of the definitions used 
for the annotation validation this would mean that lower entropy 
values yield lower False Positives (FPs), but higher False Negatives 
(FNs). The inverse holds for higher entropy values. 

Annotation Validation 

The effective evaluation of protein functional annotation is 
challenging. The lack of agreed measures and benchmarks used 
for assessment of the methods performance makes this task 
difficult. In our work we used the leave-one-out method when only 
one protein at time plays the role of a query protein. In the leave- 
one-out method a random annotation protein is selected and is 
considered as unannotated. This assumption for no terms present 
at the query protein affects different representations in different 
ways. For the unweighted representations no additional changes 
have to be made, while weighted graphs should be altered since 
the weight computation is no longer possible as defined by the 
corresponding equations. Specifically if the representation uses the 
content based weight its value is substituted with the structure 
based weight and everything else remains the same. For the 
Protein-Term representation (G3) the unannotated query protein 
assumption means that all edges to term nodes should be deleted. 



Once the clustering algorithm has been applied, for each term 
present in the query cluster (i.e. the cluster of the query protein) we 
calculate its rank according to Eq. (15), and all ranks are then 
normalized to a range between 0 and 1 . We should also note here 
that when the unannotated query protein assumption causes 
changes within the graph representation the clustering algorithm 
should be run for each query protein. The query protein is 
annotated with all functions that have rank above a previously 
determined threshold oj. For example, for co = 0, the query protein 
is assigned with all the functions present in its cluster. We change 
the threshold in the [0, 1] range and compute the numbers for the 
four possible different classes which can occur during the 
assignment process: 

• True Positive (TP): When annotation is assigned and is part of 
the true annotation set 

• True Negative (TN): When annotation is not assigned to the 
protein and is not part of the true annotation set 

• False Positive (FP): When annotation is assigned but is not part 
of the true annotation set 

• False Negative (FN): When annotation is not assigned but is 
part of the true annotation set 




0.4 0.5 0.6 

false positive rate 



1 

0.95 

0.9 
0.85 

0.8 
0.75 

0.7 
0.65 

0.6 
0.55 

0.5 




simple 



content structure hybrid ProteinTerm 



Figure 6. Results for the functional annotation for each graph representation using edgeCluster, showing the ROC curve (A) with H 
indicating the corresponding entropy value and the corresponding AUC values (B). 

doi:1 0.1 371 /journal.pone.0099755.g006 
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1 
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0.8 
0.7 
p 0.6 
! 0.5 
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0.1! 
0 




simple H=0.2807 
FFC H=0.64519 
weighted content H=0. 31214 
weighted structure H=0.3042 
weighted hybrid H=0.3001 
ProteinTerm H=0.5589 



0.4 0.5 0.6 
false positive rate 
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0.95 

0.9 
0.85 

0.8 
0.75 

0.7 
0.65 

0.6 
0 55 

0.5 




simple 



hybrid ProteinTerm 



Figure 7. Results for the functional annotation for each graph representation using BGLL, showing the ROC curve (A) with H 
indicating the corresponding entropy value and the corresponding AUC values (B). 

doi:1 0.1 371 /journal. pone.0099755.g007 



Each annotation is assigned to one of the four classes. Using the 
number of annotations in each class (given in brackets above) we 
can calculate the following statistical measures: 



Sensitivity(TruePositiveRate) = 



TP 



TP + FN 



FalsePositiveRate = 



FP 
FP + FN 



(21) 



(22) 



Graphed as coordinate pairs, the Sensitivity and the FalsePositi- 
veRate form the Receiver Operating Characteristic curve (or 
ROC curve). The ROC curve describes the performance of a 
model across the entire range of classification thresholds. The Area 
Under Curve (AUC) of a classifier is equivalent to the probability 
that the classifier will rank a randomly chosen positive instance 
higher than a randomly chosen negative instance [58]. 

We performed functional annotation for each combination of a 
clustering algorithm and a graph representations of our Saccha- 
romyces cerevisiae PIN. Figures 4-8 show the ROC curves and 
the AUC values for each graph representation for Infomap, 
timeBGLL, edgeCluster, BGLL and FC, respectively. Tables 5-9, 



show the sensitivity and false positive rate at threshold values from 
ro = 0 to co = 0.9 with 0.1 step. 

We can see from the results shown on Figures 4—8 and Tables 5— 
9 what we previously stated about the influence of the entropy 
value. As expected the more complex representations (G3 or 
ProteinTerm and (74 or FFC graph) have higher entropy values 
which implicitly increases the Sensitivity and fpr values (by 
increasing the FP and decreasing FN). The opposite holds for the 
simpler representations (Gi or Simple and G2 or Weighted graph). 

If we average the AUC values for a single algorithm over all 
graph representations (Table 10) the top ranking algorithm is the 
edge clustering with AvgA\JC e d ge cimter = 0.9065, followed by 
AvgAUC/„ /0 ,„ a/ , = 0.8963, " AvgAUC„,„ fiiCLi = 0.8913, 

AvgAUQ; GZ . L = 0.8864, and A vgAUC FC = 0.8831. This result 
is in line with the well known fact that protein interaction networks 
have many multifunctional proteins that perform several functions, 
and are expected to interact specifically with distinct sets of 
partners, simultaneously or not, depending on the function 
performed. If we look in more detail at Tables 5—9 we can get a 
better perspective about the quality of the different annotation 
process based on each of the clustering algorithms. 

Table 1 1 shows the corresponding sensitivity and false positive 
rate values for each of the algorithms combined with each of the 
representations at a fixed threshold co = 0. These values are a 




false positive rate 



Figure 8. Results for the functional annotation for each graph representation using FC, showing the ROC curve (A) with H 
indicating the corresponding entropy value and the corresponding AUC values (B). 

doi:1 0.1 371 /journal. pone.0099755.g008 
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general indicator of the behaviour of the corresponding annotation 
process. The EdgeCluster algorithm shows much greater false 
positive rate as compared to the next in line (according to 
AvgAUC) Infomap. In fact, Infomap has the overall lowest levels 
of false positive rates with any graph representation. This means 
that Infomap performs very stringent clustering of the PIN which 
results in clusters that are poor in terms of function (term) diversity 
therefore missing out on part of the functions (terms) which should 
be associated with a query protein. This leads to a very precise, but 
incomplete view of the annotation set of the query protein. On the 
other hand EdgeCluster, timeBGLL, BGLL, and FC achieve 
much higher sensitivity at the price of a high false positive rate, 
which means that the annotation set view is much richer but more 
noisy as compared to Infomap. AH of these results are due to the 
fact that the ratio between the number of clusters generated with 
Infomap and the other algorithms (all have similar numbers of 
clusters) is approximately 2.5:1. 

The performance of the algorithms on the different graph 
representations proposed in this research is consistent in all the 
experiments as can be seen in Table 10. As expected the simple 
graph representation (Gi) has the lowest AUC values for all 
clustering approaches. The hybrid weighting scheme (G2) outper- 
forms each of the separate content and structure weighting, with 
structure being more informative than the content. The rise in 
performance noted when using the FFC graph representation (G4) 
suggests that the actual PIN is lacking part of the real interactions 
that occur between pairs of proteins. Finally, the Protein-Term 
representation (G3) yields the best results in terms of AUC, but 
both G3 and G4 have the noisy annotation problem as stated 
before (even for the usually low noise Infomap algorithm). In terms 
of complexity it is clear from Tables 1 and 2 that the G3 and G4 
representations are more complex and this computational 
complexity should be taken into account when deciding on the 
appropriate representation for a PIN. Also a network wide 
annotation would be very impractical if we use G2,Gi, or G4, since 
the clustering algorithm needs to be run for every query protein. 
On the other hand a scenario in which a wider set of possible 
annotations needs to be determined for a single (or a few) protein(s) 
would greatly benefit from these augmented PIN graph represen- 
tations. 

In summary and considering the goals defined our results show 
that all of the proposed novel representations yield a significant 
improvement in the function prediction performance over the 
simple unweighted graph representation. The Protein-Term graph 
representation is the most informative one and if computational 
resources are not scarce it is the representation that should be used 
for the prediction. The next in line is the FFC graph 
representation, followed by the hybrid weighted graph represen- 
tation. The ease of further augmentation of these two represen- 
tation (for example with similarity metrics based on GO instead of 
using a simple Jaccard index) is their added value and they can be 
further improved to maximize the annotation prediction perfor- 
mance. All of the clustering algorithms used in this paper perform 
very good on the PIN, as it was shown in the clustering validation 
section, with Infomap being the best in that context. In terms of 
using these clustering algorithms in the function prediction the 
most accurate one is the Infomap algorithm, while edgeCluster 
and timeBGLL have the highest coverage. 

As a final note we point out to another potential problem in the 
process of function prediction using clustering, namely the 
completeness. It has been estimated that the complete S. 
cerevisiae network has between 37800 and 75500 protein 
interactions [59]. Currently there are between 55000 and 60000 
interactions contained in publicly available repositories for S. 
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Table 10. Values for the AUC for the functional annotation with each clustering algorithm and graph representations for the PIN 
and the average AUC values per algorithm and per representation. 



Clustering Alg. /Representation 


FC 


BGLL 


TimeBGLL 


EdgeCluster 


Infomap 


AvgAUC 


Simple 


0,8432 


0,8446 


0,8653 


0,8789 


0,8589 


0,8581 


Weighted Content 


0,8733 


0,8740 


0,8757 


0,8902 


0,8730 


0,8771 


Weighted Structure 


0,8835 


0,8868 


0,8913 


0,9046 


0,8886 


0,8909 


Weighted Hybrid 


0,8882 


0,8917 


0,8975 


0,9107 


0,8928 


0,8961 


Protein-Term 


0,9267 


0,9341 


0,9210 


0,9420 


0,9660 


0,9379 


FFC 


0,8839 


0,8875 


0,8979 


0,9129 


0,8986 


0,8962 


AvgAUC 


0,8831 


0,8864 


0,8913 


0,9065 


0,8963 
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cerevisiae, which means there are potentially unknown regions of 
the network which can explain the high false positive rates and low 
sensitivity stated before. 

Conclusions 

Complex protein interaction networks reveal graph properties 
that can be analysed in terms of functional modules associated 
with the biological function they perform. In our work we 
investigated the power of the novel algorithm for complex network 
clustering combined with novel graph representations of the 
protein interaction networks, and assess their possibilities for 
protein function prediction via clustering. We show that using 
these algorithms we can gain significant knowledge for the 
modular structure of the network. As these networks carry not 
only interaction information but also annotations the different 
representations we propose augment to the prediction process by 
including this information in the clustering of the network. 

The results from our experiments validate the augmented graph 
representation approach. Even the simplest augmentation i.e. the 
different weighted graph representations of the PIN significantly 
improve the results of the function prediction. Our experiments 
were performed using the simple normalized Jaccard Index as a 



weighting factor and we are confident that results can be even 
further improved using a more sophisticated weighting scheme. 
We used the same weighting when we further augmented the 
graph representation by adding artificial edges to take into account 
the well known fact that protein interaction networks to this date 
are still not completely captured by the experimental methods 
used for their construction. This representation is very complex 
and is computationally exhaustive but the potential of uncovering 
new knowledge is significantly increased. Our experiments showed 
that the most informative representation is the one where we 
generate a graph in which every single term associated with a 
protein becomes a node and the association of proteins and terms 
is represented by adding an edge between each pair. The power of 
unravelling the functions of a query protein of this representation 
is the greatest of all proposed representations, but also the same 
holds for the computational complexity. 

In general if one would like to perform a network wide 
annotation, usage of the weighted graph representations would be 
recommended, while the exploration of a single protein, or a small 
group of proteins, should the performed using either the full 
functional connected graphs or the protein-terms graph. In terms 
of selecting a clustering algorithm our results showed that Infomap 
has the best performance in determining the modular structure of 



Table 11. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph 
representation using each of the clustering algorithms, at a fixed threshold value (co = 0). 





Clustering Alg. /Representation 


FC 


BGLL 


TimeBGLL 


EdgeCluster 


Infomap 


Simple 


sens. 


0,8343 


0,7166 


0,8185 


0,8184 


0,7523 




fpr 


0,2995 


0,0447 


0,1724 


0,1403 


0,0614 


Weighted Content 


sens. 


0,8393 


0,8381 


0,8220 


0,8992 


0,7757 




fpr 


0,1972 


0,1942 


0,1525 


0,2740 


0,0547 


Weighted Structure 


sens. 


0,8511 


0,8524 


0,8426 


0,9171 


0,8034 




fpr 


0,1864 


0,1833 


0,1376 


0,2613 


0,0536 


Weighted Hybrid 


sens. 


0,8605 


0,8615 


0,8539 


0,9232 


0,8104 




fpr 


0,1812 


0,1786 


0,1389 


0,2537 


0,0531 


Protein-Term 


sens. 


0,9997 


0,9992 


0,9995 


0,9997 


0,9999 




fpr 


0,3723 


0,3416 


0,3927 


0,4104 


0,1456 


FFC 


sens. 


0,9763 


0,9693 


0,9573 


0,9823 


0,9636 




fpr 


0,4431 


0,4361 


0,3492 


0,5042 


0,4164 
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a PIN and is also the most accurate of all tested algorithms. 
However, the high accuracy comes with the price of low coverage 
(i.e. the inability to discover a larger set of functions associated 
with a query protein). The opposite holds for the timeBGLL and 
EdgeCluster algorithms. Depending on the required results one 
can choose either a random walk and map algorithm (Infomap) if 
the priority is to get a narrow set of accurate protein functions, or 
either an edge clustering/overlapping clusters algorithm (Edge- 
Cluster) or a multi-resolution algorithm (timeBGLL) if coverage of 
the possible functions is of bigger importance. 

References 

1. von Mering G, Krausc R, Sncl B, Cornell M, Oliver SG, ct al. (2002) 
Comparative assessment of large-scale data sets of protein-protein interactions. 
Nature 417: 399-403. 

2. Hakes L, Lovell SC, Oliver SG, Robertson DL (2007) Specificity in protein 
interactions and its relationship with sequence diversity and cocvolution. PNAS 
104: 7999-8004. 

3. Harwell LH, Hopfield JJ, Leiblel S, Murray AW (1999) From molecular to 
modular cell biology. Nature 402: c47— c52. 

4. Punta M, Ofran Y (2008) The rough guide to in silico function prediction, or 
how to use sequence and structure information to predict protein function. PLoS 
Comput Biol 4: el000160. 

5. Yu GX, Glass EM, Karonis NT, Maltsev N (2005) Knowledge-based voting 
algorithm for automated protein functional annotation. PROTEINS: Structure, 
Function, and Bioinformatics 61: 907-917. 

6. The gene ontology consortium (2000) Gene ontology: Tool for the unification of 
biology. Nature Genetics 25: 25-29. 

7. Brohcc S, van HeldenJ (2006) Evaluation of clustering algorithms for protein- 
protein interaction networks. BMC Bioinformatics 7: 48. 

8. Barabasi A, Oltvai Z (2004) Network biology: understanding the cell's functional 
organization. Nat Rev Genet 5: 101-1 13. 

9. Arnau V, Mars S, Marin I (2005) Iterative cluster analysis of protein interaction 
data. Bioinformatics 21: 364-378. 

10. Rives A, Galitski T (2003) Modular organization of cellular networks. PNAS 
100: 1128-1133. 

11. Friedel C, Zimmcr R (2006) Inferring topology from clustering coefficients in 
protein-protein interaction networks. BCM Bioinformatics 7: 519. 

12. Dunn R, Dudbridge F, Sanderson C (2005) The use of edgc-betweenncss 
clustering to investigate biological function in pins. BCM Bioinformatics 6: 39. 

13. Luo F, Yang Y, Chen CF, Chang R, Zhou J, et al. (2007) Modular organization 
of protein interaction networks. Bioinformatics 23: 207—214. 

14. Newman M, Girvan M (2004) Finding and evaluating community structure in 
networks. Phys Rev E Stat Nonlin Soft Matter Phys 69: 0261 13. 

15. ChenJ, Yuan B (2006) Detecting functional modules in the yeast protein-protein 
interaction network. Bioinformatics 18: 2283-2290. 

16. Asur S, Ucar D, Parthasarathy S (2007) An ensemble framework for clustering 
protein- protein interaction networks. Bioinformatics 23: 129 — 40. 

17. Enright AJ, Dongen S, Ouzounis CA (2002) An efficient algorithm for large- 
scale detection of protein families. Nucleic Acids Res 30: 1575-84. 

18. King AD, Przulj N, Jurisica I (2004) Protein complex prediction via cost-based 
clustering. Bioinformatics 20: 3013-20. 

19. Blatt M, Wiseman S, Domany E (1996) Superparamagnetic clustering of data. 
Phys Rev Lett 76: 3251-54. 

20. Bader G, Hogue G (2003) An automated method for finding molecular 
complexes in large protein interaction networks. BMC Bioinformatics 4: 4. 

21. Spirin V, Mirny LA (2003) Protein complexes and functional modules in 
molecular networks. PNAS 100: 12123-8. 

22. Przulj N, Wigle DA, Jurisica I (2004) Functional topology in a network of protein 
interactions. Bioinformatics 20: 340-348. 

23. Sen T, Kloczkowski A, Jcrnigan R (2006) Functional clustering of yeast proteins 
from the protein-protein interaction network. BMC Bioinformatics 7: 355. 

24. Mukhopadhyay A, Ray S, De M (2012) Detecting protein complexes in ppi 
network: A gene ontology-based multiobjeetivc evolutionary approach. Molec- 
ular BioSystems, Royal Society of Chemistry 8: 3036-3048. 

25. Zhang Y, Lin H, Yang Z, Wang J, Li Y, ct al. (2013) Protein complex prediction 
in large ontology attributed protein-protein interaction networks. IEEE/ACM 
Transactions on Computational Biology and Bioinformatics 10: 728—741. 

26. Uetz P, Giot L, Gagncy G, Mansfield TA, Judson RS, et al. (2000) A 
comprehensive analysis of protein-protein interactions in Saccharomyccs 
cerevisiac. Nature 403: 623-627. 

27. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, et al. (2001) A comprehensive 
two-hybrid analysis to explore the yeast protein interactome. Proe Natl Acad Sci 
USA 98: 4569-^574. 

28. Ho Y, Gruhler A, Hcilbut A, Bader GD, Moore L, ct al. (2002) Systematic 
identification of protein complexes in Saccharomyces cerevisiae by mass 
spectrometry. Nature 415: 180-183. 



Supporting Information 

File S 1 Matlab code for generation of the graph 
representations. 

(ZIP) 

Author Contributions 

Wrote the paper: KT AB LK. Designed the research: KT AB LK. 
Developed numerical tools and performed simulations: KT AB. Discussed 
results and reviewed the manuscript: KT AB LK. 



29. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, et al. (2006) Global landscape 
of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440: 637- 
643. 

30. Gavin AG, Aloy P, Grandi P, Krausc R, Boeschc M, ct al. (2006) Protcome 
survey reveals modularity of the yeast cell machinery. Nature 440: 631-636. 

31. Salwinski L, Miller CS, Smith AJ, Pettit EK, Bowie JU, et al. (2004) The 
Database of Interacting Proteins: 2004 update. Nucleic Acids Res 32: D449- 
451. 

32. Giildener U, Miinsterkotter M, Oesterheld M, Pagel P, Rucpp A, ct al. (2006) 
MPact: the MIPS protein interaction resource on yeast. Nucleic Acid Research 
34: D436-41. 

33. Chatr-aryamontri A, Ceol A, Palazzi LM, Nardelli G, Schneider MV, et al. 
(2007) MINT: the Molecular INTcraction database. Nucleic Acids Res 35: 
D572-574. 

34. Bader GD, Hogue GWV (2000) BIND a data specification for storing and 
describing biomolccular interactions, molecular complexes and pathways. 
Bioinformatics 16: 465-477. 

35. Breitkreutz BJ, Stark C, Tyers M (2003) The GRID: The General Repository for 
Interaction Datasets. Genome Biology 4: R23. 

36. Dwight SS, Harris MA, Dolinski K, Ball GA, Binkley G, et al. (2002) 
Saccharomyccs Genome Database (SGD) provides secondary gene annotation 
using the Gene Ontology (GO). Nucleic Acids Res 30: 69-72. 

37. Letovsky S, Kasif S (2003) Predicting protein function from protein/protein 
interaction data: a probabilistic approach. Bioinformatics 19 Suppl 1: il97— 1204. 

38. Blocked H, Rahmani H, Witscnburg T (2010) On the importance of similarity 
measures for planning to learn. In: 19th European Conference on Artificial 
Intelligence, 3rd Planning to Learn workshop, PlanLcarn-20 1 0. International 
Workshop on Planning to Learn, pp. 69-74. 

39. Fortunato S (2010) Community detection in graphs. Physics Reports 486: 75- 
174. 

40. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a 
comparative analysis. Phys Rev E Stat Nonlin Soft Matter Phys 80: 0561 17. 

41. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in 
networks. Phys Rev E Stat Nonlin Soft Matter Phys 69: 026113. 

42. Newman MEJ (2004) Analysis of weighted networks. Phys Rev E Stat Nonlin 
Soft Matter Phys 70: 056131. 

43. Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very 
large networks. Phys Rev E Stat Nonlin Soft Matter Phys 70: 0661 1 1. 

44. Blondel VD, Guillaumc JL, Lambiotte R, Lefebvre E (2008) Finding community 
structure in very large networks. Journal of Statistical Mechanics: Theory and 
Experiment 2008: P10 008. 

45. Fortunato S, Barthelemy M (2007) Resolution limit in community detection. 
Proc Nat Acad Sci USA 104: 36-41. 

46. Lambiotte R (2010) Multi-scale modularity in complex networks. In: WiOpt. pp. 
546-553. 

47. Reichardt J, Bornholdt S (2006) Statistical mechanics of community detection. 
Phys Rev E Stat Nonlin Soft Matter Phys 74: 0161 10. 

48. Lambiotte R, Dclvcnne J, Barahona M (2009) Laplacian dynamics and 
multiscale modular structure in networks. Available: http:/ / arxiv.org/ abs/ 
0812.1770. ArXiv:0812.177. 

49. Ahn YY, Bagrow JP, Lehmann S (2010) Link communities reveal multiscale 
complexity in networks. Nature 466: 761-764. 

50. Evans TS, Lambiotte R (2009) Line graphs, link partitions, and overlapping 
communities. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics 
80: 016 105+. 

51. Evans TS, Lambiotte R (2010) Line graphs of weighted networks for overlapping 
communities. The European Physical Journal B - Condensed Matter and 
Complex Systems 77: 265-272. 

52. Rosvall M, Bcrgstrom CT (2008) Maps of random walks on complex networks 
reveal community structure. Proc Natl Acad Sci USA 105: 1118—1123. 

53. Schwikowski B, Uetz P, Fields S (2000) A network of protein-protein interactions 
in yeast. Nat Biotechnol 18: 1257-1261. 

54. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing 
community detection algorithms. Phys Rev E Stat Nonlin Soft Matter Phys 78: 
046110. 



PLOS ONE | www.plosone.org 



15 



June 2014 | Volume 9 | Issue 6 | e99755 



Function Prediction in PINs via Clustering 



55. Danon L, Guilcra AD, Duch J, Arenas A (2005) Comparing community 
structure identification. Journal of Statistical Mechanics: Theory and Experi- 
ment 720058: P09 008-09 008. 

56. Lancichinetti A, Fortunato S, Kertcsz J (2009) Detecting the overlapping and 
hierarchical community structure of complex networks. New Journal of Physics 
11:033015. 



57. Dong D, Bing /, I Ian JDJ (2007;; Comparing the biological coherence of 
network clusters identified by different detection algorithms. Chinese Science 
Bulletin 21: 2938-2944. 

58. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognition Letters 
27: 861-874. 

59. Hart GT, Ramani AK, Marcotte EM (2006) How complete are current yeast 
and human protein-interaction networks? Genome Biol 7: 120. 



PLOS ONE | www.plosone.org 



16 



June 2014 | Volume 9 | Issue 6 | e99755 



